Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One zarr to rule them all #12

Merged
merged 21 commits into from
Apr 22, 2024
Merged

One zarr to rule them all #12

merged 21 commits into from
Apr 22, 2024

Conversation

sadamov
Copy link
Collaborator

@sadamov sadamov commented Mar 26, 2024

This PR simplifies the dataloader and input-data creation.
New features

  • create_zarr_archive.py now generates one huge zarr-archive directly from the COSMO-2 GRIB-files. @cosunae this might be of interest for offline data preparation.
  • weather_dataset.py now loads one zarr archive and __get_item__ now also returns the datetimes of the current batch
  • In ar_model.py the datetimes of the batch are compared to constants.EVAL_DATETIMES. Massively simplifying model testing and prediction @clechartre this is certainly relevant for prediction/verification.

Reasoning

  • After discussions within ECMWF/Neural-LAM it became pretty clear that one single zarr is the way to go. Lazy loading and parallelization allow for huge archives. The reason why it didn't work for me last year was most likely related to the issues we had with /scratch being extremly slow and unresponsive. It's also much easier to share dataloaders and datasets this way.
  • Verification of the model is much simplified keeping track of the datetime of each batch. Model predictions can still be easily exported using the constants.STORE_EXAMPLE_DATA.

Notes

  • Zarr creation is ongoing, it roughly takes 3-4 days without Balfrin crashing.
  • For now I have created a small cosmo_single dataset that can be used for training and evaluation.
  • A new dummy example.ckpt was uploaded for evaluation/testing (trained on very small dataset)
  • Maybe it makes sense to wait for the full zarr creation and retraining of the model before this PR is merged. I really wanted to share the code already to prevent duplicated work though.

@sadamov sadamov requested a review from twicki March 26, 2024 15:34
@sadamov sadamov mentioned this pull request Apr 11, 2024
3 tasks
@sadamov sadamov requested a review from clechartre April 16, 2024 07:55
@sadamov
Copy link
Collaborator Author

sadamov commented Apr 16, 2024

Okay this PR is ready for review and merge. I have copied @cosunae latest single zarr to balfrin.cscs.ch and updated the code accordingly. All zarr-creation scripts are now removed from the repo. The new example file is trained on a few timesteps just for debugging. I suggest to do the following before merge:

  • @twicki can you execute slurm_train.py and slurm_eval.py and make sure it works
  • @clechartre can you execute slurm_predict.py and cli_plotting.py and make sure everything is still as you wanted it to be
    In case you have additional comments on the new functionalities, you are very welcome to add them to your review.

Copy link
Collaborator

@clechartre clechartre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is functional for the inference and the prediction plotting

Pinning pytorch to a specific version was required because
otherwise on some systems cpu variants would be installed
@sadamov sadamov merged commit 22cddcb into main Apr 22, 2024
1 check passed
@sadamov sadamov deleted the one_zarr branch April 22, 2024 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants